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Abstract 


AdaBoost j 5] is a well-known ensemble learning algorithm that con- 
structs its constituent or base models in sequence. A key step in 
AdaBoost is constructing a distribution over the training exam- 
ples to create each base model. This distribution, represented as 
a vector, is constructed to be orthogonal to the vector of mistakes 
made by the previous base model in the sequence [6]. The idea is 
to make the next base model’s errors uncorrelated with those of 
the previous model. Some researchers have pointed out the intu- 
ition that it is probably better to construct a distribution that is 
orthogonal to the mistake vectors of all the previous base models, 
but that this is not always possible [6]. We present an algorithm 
that attempts to come as close as possible to this goal in an efficient 
manner. We present experimental results demonstrating significant 
improvement over AdaBoost and the Totally Corrective boosting 
algorithm [8], which also attempts to satisfy this goal. 


1 Introduction 

Ensemble learning algorithms are machine learning algorithms that, given a training 
set, construct a combination of base models drawn from a designated hypothesis 
class. AdaBoost [5 is one of the most well-known and high-performing ensemble 
learning algorithms. It constructs a sequence of base models, where each model is 
constructed based on the performance of the previous model on the training set. 
In particular, AdaBoost calls the base model learning algorithm with a training set 
weighted by a distiibution. 1 After the base model is created, it is tested on the 
training set to see how well it learned. We assume that the base model learning 
algorithm is a weal learning algorithm ; that is, its error is less than 0.5, so that 

: If the base model learning algorithm cannot take a weighted training set as input, 
then we can create a sample with replacement from the original training set according to 
the distribution and -all the algorithm with that sample. 



it performs better than random guessing. 2 The weights of the correctly classified 
examples and misclassified examples are scaled down and up, respectively, so that 
the two groups’ total weights are 0.5 each. The next base model is generated by 
calling the learning algorithm with this new weight distribution and the training 
set. The idea is that, because of the weak learning assumption, at least some of 
the previously misclassified examples will be correctly classified by the new base 
model. Previously misclassified examples are more likely to be classified correctly 
because of their higher weight, which focuses more attention on them. Kivinen 
and Warmuth [6] have shown that AdaBoost scales the distribution with the goal 
of making the next base model’s mistakes uncorrelated with those of the previous 
base model. It is well-known that ensembles need to have low correlation in their 
base models’ errors in order to perform well [1, 9]. 

Given this point, v e would think, as was pointed out in [6], that AdaBoost would 
perform better if the next base model’s mistakes were uncorrelated with those of 
all the previous base models instead of just the previous one. It turns out that it 
is not always possible to construct a distribution consistent with this requirement. 
However, we can attempt to get as close as possible to such a distribution. That is, 
we may attempt to find a distribution such that the resulting base model’s mistakes 
are as close as possible to being uncorrelated to all the past base models mistakes. 
Kivinen and Warmuth [6] devised the totally corrective boosting algorithm, which 
attempts to do this. However, they do not present any empirical results. Also, they 
hypothesize that this algorithm will overfit and; therefore, not perform well. This 
paper presents a new algorithm, called Averaging AdaBoost, which has the same 
goal as the totally corrective algorithm. In particular, our algorithm calculates the 
next base model’s distribution by first calculating a distribution the same way as in 
AdaBoost, but then averaging it elementwise with those calculated for the previous 
base models. In tiiis way, our algorithm attempts to take all the previous base 
models into account in constructing the next model’s distribution. In Section 2, we 
review how AdaBoost works and, in particular, how it constructs its distributions 
over the training set for each base model. We also describe the totally corrective 
algorithm here. In Section 3, we state our algorithm and describe the sense in which 
our solution is the best one possible. In Section 4, we present an experimental 
comparison of our algorithm with AdaBoost and the totally corrective algorithm. 
Section 5 summarizes this paper and describes ongoing and future work. 


2 AdaBoost 

Figure 1 shows Ad uBoost’s pseudocode. AdaBoost constructs a sequence of base 
models h t for t 6 [1,2,..., I 1 }, where each one is constructed based on the per- 
formance of the previous base model on the training set. In particular, AdaBoost 
maintains a distribution over the m training examples. The distribution di used 
in creating the firs:, base model gives equal weight to each example (d M = 1/m 
for all i € {1,2,. . . m}). The base model learning algorithm L& is called with the 


2 The version of AdaBoost that we use was designed for two-class classification problems. 
However, it is routinely used for a larger number of classes when the base model learning 
algorithm is strong enough to have an error less than 0.5 in spite of the larger number o 
classes. 



AdaBoost({(xi, ;</i), . . • , (^m, 2/m)}, ^6, T) 
Initialize di,i — 1/m for all i € {1,2,..., m}. 
For t = 1,2,...,T: 

h t = L b ({(xi t yi)>... ,(x mi y m )},dt). 
Calculate he error of h t : tt = 

If e t > 1/2 then, 

set T = t — 1 and abort this loop. 
Calculate distribution d t +i: 


dt+ i,i — dt,i x 



if h t (xi) = y t 
otherwise. 


Output the fii al hypothesis: 
hfin(x) = 


Figure 1: AdaBoost algorithm: {(xijt/i), . . . , (£ m >ym)} is the training set, L b is the base 
model learning algorithm, and T is the maximum allowed number of base models. 


Totally Corrective AdaBoost ({(#i , t/i), .*■, (im, 2/m)}, Lb, 7 1 ) 
Initialize di ( i = 1/m for alH G (1 } 2, . . . , m}. 

For t = 1,2,... T: 

ht = Lfc({(.Sl, yi), • • « ) (^tnj ym )} ) dt). 

Calculate the mistake vector u t : 


f 1 if ht(xt) = yi 

1 —1 otherwise. 


If d t ut < 0 then, 

set T -- t — 1 and abort this loop. 
Calculate cistribution dt+i: 

Initialize di = di . 


For j == 1,2, . . 

qj = argmax <? . 6{lf2f . 


|dj 


^qjl 


= In 


l+dj-Uqj 

l-3j u qj 


F( r all t 6 {1, 2, . . . , m), 

d|+i,i = i~dj,iexp(-dju qjl i) } 

*3 

where Z 3 = J2iLi dj,iexp(-ajU q . t i) 


Output the fin il hypothesis: 

h f in (x) = ergmax^gy. E f.h t (x)=„ 


is the normalizing factor. 


Figure 2: Totally Corrective Boosting algorithm: {(xi,yi),. . - , (:r m ,ym)} is the training 
set, L b is the base model learning algorithm, and T is the maximum allowed number of 
base models. 





training set and d . 3 The returned model h\ is then tested on the training set to 
see how well it learned. Training examples misclassified by the current base model 
have their weights ncreased for the purpose of creating the next base model, while 
correctly-classified training examples have their weights decreased. More specifi- 
cally, if h t misclas. c ifies the zth training example, then its new weight d Mii is set 
to be its old weighl d tyi multiplied by where e t is the sum of the weights of the 
examples that ht misclassifies. AdaBoost assumes that Li is a weak learner , i.e., 
e t < \ with high probability. Under this assumption, ^7 > 1 , so the ith example’s 
weight increases (d 4-1,1 > On the other hand, if ht correctly classifies the ith 

example, then d t+hi is set to d t)i multiplied by which is iess than one 

the weak learning assumption; therefore, example V s weight is decreased. Under 
distribution d t +i, the total weight of the examples misclassified by h t and those 
correctly classified l)y ht become 0.5 each. This is done so that, by the weak learning 
assumption, h t +i will classify at least some of the previously misclassified examples 
correctly. 

For all the base models ht (f 6 { 1,2 ,T}) and the m training examples, con- 

struct a vector u t £ [— l,l] m such that the zth element ut t i = 1 if ht classifies 
the zth training example correctly (h t (x t ) = yt) and u t ,i — -1 otherwise. Kivinen 
and Warmuth [6] pointed out that AdaBoost calculates d t +i from d t such that 
d t+1 • u t = 0 . That is, the new distribution is created to be orthogonal to the mis- 
take vector of h t , wnich can be intuitively described as wanting the new base model 
to reduce a suitable loss function in a direction orthogonal to what the previous 
base model did, so hat the new' base model’s mistakes are uncorrelated with those 
of the previous model. This naturally leads to the question of whether one can im- 
prove upon AdaBoost by constructing dt+i to be orthogonal to the mistake vectors 
of all the base hypotheses h \ , h ? , . • • , ht (i.e. , dt+i • u q = 0 for all q 6 { 1 , 2, . . . , f }) . 
Constructing such a dt+i is not always possible. In particular, if m > t, then the 
system of equations just given is overspecified, so that there may not be a solution. 
Kivinen and Warmuth ’s totally corrective algorithm (figure 2 ) attempts to solve this 
problem using an iterative method. The initial parts of the algorithm are similar to 
AdaBoost. That is, the totally corrective algorithm uses the same di as AdaBoost 
in creating the first base model and the next statement checks that the base model 
error is less than 0.5. The difference is in the method of calculating the weight dis- 
tribution for the next base model. The totally corrective algorithm repeatedly finds 
the one among the t constraints that is most strongly violated, i.e., finds the value 
q 3 having the highest value of |dj ■ u<J, and then projects the current distribution 
onto the hyperplane defined by that violated constraint. This is similar to so-called 
row action optimization methods [ 3 , 4 ]. Kivinen and Warmuth show that, if there 
is a distribution that satisfies all the constraints, then there is an upper bound of 

2 tn s m on the numbe r of iterations needed so that max^.gp^ *} |dj * u qjl ^ 7 f° r 

any 7 > 0 . Of course, as mentioned earlier, we cannot generally assume that there 
is a distribution that satisfies all the constraints, in which case there is no such 
bound on the number of iterations. In fact, we are not even guaranteed to reduce 
max <?>e{ i |2 |dj • tiqj at each iteration. To make the totally corrective algorithm 
usable for our experiments, we have added two stopping criteria not present in the 
original algorithm. Define Vtj — max ffj . G {x i 2,...,t} |dj * u qjl- The algorithm stops if 

3 As mentioned earlier, if Lb cannot take a weighted training set as input, then we 
can give it a sample drawn with replacement from the original training set according to 
distribution d. 



Averaging AdaUoost({(:n , yi), . . . , (^m, Vm )}> L b , T) 
Initialize d\ % i = 1/m for all i € {1,2,... ,m}. 

For t = 1, 2, . . . , T: 

hi = Xrb({(-Cl, yi). - ■ •> 

Calculate the error of h t . tt = Yh-.htixO^yi 
If c t > 1/2 then, 

set T := f - 1 and abort this loop. 

Calculate orthogonal distribution: 

For i == 1, 2, . . . , m: 






dt.i X ( ^- £t) 
l 2e t 

f + 1 


if h t {xi) = y t 
otherwise 


Output the final hypothesis: 

hfin(x) = argmaXygy- T,t-.h,(x)=y lo 9 k 77 1 '- 


Figure 3: Our Averaging AdaBoost algorithm: {(:ri . yi ).•••» (%m, J/m)} is the training set, 
L b is the base model learning algorithm, and T is the maximum allowed number of base 
models. 


either v t j-v t ,j-i < 0.0001 or both j >m and v tJ > v t j-i. The first constraint re- 
quires that the maximum dot product decrease by some minimum amount between 
consecutive iterations. The second constraint leaves the loop if, after iterating at 
least as many times as the number of training examples, the maximum dot product 
increases. These ar s heuristic criteria devised on the basis of observations of some 
of our experiments with this algorithm. 

In the next section, we describe our algorithm. 

3 Our algorithm 

Figure 3 shows our new algorithm. Just as in AdaBoost, our algorithm initializes 
d l}i = 1/m for all ! € {l,2,...,m}. Then it goes inside the loop, where it calls 
the base model leaning algorithm L b with the training set and distribution di and 
calculates the error • )f the resulting base model hi . It then calculates ci , which is the 
distribution that AdaBoost would use to construct the next base model. However, 
our algorithm averages this with d, to get d 2 , and uses this d 2 instead. The loop 
continues for a total of T iterations. The vector d t+ i is a running average of the 
vectors Cq for q £ 1 1, 2, . . . , t}, which are orthogonal to the mistake vectors of the 
previous t base mor els (u q for q £ {1,2,..., t}), respectively. 

It is well-known that this d t+ i has the least average Euclidian distance to the 
vectors c q for q £ 1,2,...,*}. In this sense, our algorithm finds a solution that 

does the best job cf balancing among the t constraints c q • u q = 0 without the 
computational cost of a convex optimization method. It is easy to prove that d t +i 
is already a distribution (i.e., normalization is unnecessary), but space precludes us 




Table 1: The datasets used in our experiments. 


Data Set 

Training 

Set 

Test 

Set 

Inputs 

Classes 

Pi omoters 

84 

22 

57 

2 

Balance 

500 

125 

4 

3 

Breast Cancer 

559 

140 

9 

2 

German Credit 

800 

200 

20 

2 

Car Evaluation 

1382 

346 

6 

4 

Chess 

2556 

640 

36 

2 

Mushroom 

6499 

1625 

22 

2 

Nursery 

10368 

2592 

8 

5 

Connect4 

54045 

13512 

42 _ 

3 


from doing so here 

We now demonstrate the experimental usefulness of this algorithm. 

4 Experimental Results 

In this section, we compare AdaBoost, the totally corrective algorithm, and our 
averaging algorithm on nine UCI datasets [2] described in Table 1. We ran all 
three algorithms with three different values of T, which is the maximum number 
of base models thar. the algorithm is allowed to construct: 10, 50, and 100. Each 
result reported is 1 he average over 50 results obtained by performing 10 runs of 
5-fold cross- validat on. Table 1 shows the sizes of the training and test sets for the 
cross-validation rur s. 

Figure 4 compares the error rates of AdaBoost and our averaging algorithm with 
Naive Bayes base models. In all the plots presented in this paper, each point 
marks the error rates of two algorithms when run with the number of base models 
indicated in the legend and a particular dataset. The diagonal line in the plots 
contain points at which the two algorithms have equal error. Therefore, points 
below/above the line correspond to the error of algorithm indicated on the y-axis 
being less than /greater than the error of the algorithm indicated on the x-axis, 
respectively. We can see that, for Naive Bayes base models, our averaging algorithm 
performs much better than AdaBoost overall. Table 2 shows how often our averaging 
algorithm significantly outperformed, performed comparably with, and significantly 
underperformed AdaBoost and the Totally Corrective Algorithm. In particular, 
for 10 base models averaging significantly outperformed 4 AdaBoost on six of the 
datasets, performed comparably on one dataset, and performed significantly worse 
on two, which is written as “+6=1-2” in the table. Figure 5 shows that our averaging 
algorithm performs substantially better than the Totally Corrective algorithm with 
our averaging algori thm. We examined the runs of the Totally Corrective algorithm 
in more detail and often found the overfitting that Kivinen and Warmuth thought 
would happen. Due to this poor performance, we did not continue experimenting 
with the totally corrective algorithm for the rest of this paper. 


4 We use a t-test vut.h a = 0.05 to compare all the classifiers in this paper. 
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AdaBoost vith Naive Bayes 


Totally Corrective AdaBoost with Naive Bayes 


Figure 4: AdaBcost vs. Averaged Figure 5: Totally Corrective Boosting 

Boosting (Naive Bayes) vs. Averaged Boosting (Naive Bayes) 


Table 2: Performance of Averaged Boosting 


Compared to 

Base Model 

10 

50 

100 

AdaBoost 

Naive Bayes 

+6=1-2 

+4=3-2 

+4=2-3 

Totally Corrective 

Naive Bayes 

+6=2-1 

+6=2-1 

+6=2-1 

AdaBoost 

Decision Trees 

+2=7-0 

+2=5-2 

+2=5-2 

AdaBoost 

Decision Stumps 

+2=6-1 

+2=4-3 

+2=3-4 


We compare AdaBoost and our averaging algorithm using decision tree and decision 
stump base models in figures 6 and 7, respectively. With decision trees, the aver- 
aging algorithm pe forms somewhat better than AdaBoost. With decision stumps, 
the differences in error rates vary much more, with averaging sometimes performing 
worse than AdaBoost. 


5 Conclusions 

We presented a boosting algorithm that trains each base model using a training 
example weight vet tor that is based on the performances of all the previous base 



AdaBoost w th Decision Trees AdaBoost with Decision Stumps 

Figure 6: AdaBo >st vs. Averaged Figure 7: AdaBoost vs. Averaged 

Boosting (Decision Trees) Boosting (Decision Stumps) 







models rather than just the previous one. We discuss the theoretical motivation for 
this algorithm and demonstrate empirical results that are superior overall relative 
to AdaBoost and the Totally Corrective algorithm that has the same goal as our 
algorithm. 

Space precluded a detailed analysis of the performances of the base models and 
their correlations, as is often done in a detailed study of ensemble methods. We 
plan to do this for a longer version of this paper in order to compare our algorithm 
to AdaBoost and ihe Totally Corrective algorithm in more detail. This analysis 
may help to explain why Averaging AdaBoost’s improvement over AdaBoost was 
greater for smallei numbers of base models. Additionally, it has been pointed 
out [7, 8] that ensembles work best when they are somewhat anti-correlated. We 
attempted to explo t this by implementing several boosting algorithms that, at each 
iteration, change the base model weights so that the correctly classified examples’ 
weights add up not to 0.5, but slightly less than 0.5. This scheme occasionally 
performed better a id occasionally performed worse than AdaBoost. Depending on 
the available running time, it may be possible to create classifiers using several of 
these weight adjustment schemes and combine all of them or a subset of them in 
an ensemble, or pe haps cease using certain weight adjustment schemes if they do 
not look promising for the dataset under consideration. 
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